Search CORE

496 research outputs found

Weighted Random Sampling - Alias Tables on the GPU

Author: Lehmann Hans-Peter
Publication venue: Karlsruher Institut für Technologie
Publication date: 28/05/2021
Field of study

Sliding Block Hashing (Slick) -- Basic Algorithmic Ideas

Author: Lehmann Hans-Peter
Sanders Peter
Walzer Stefan
Publication venue
Publication date: 18/04/2023
Field of study

We present {\bf Sli}ding Blo{\bf ck} Hashing (Slick), a simple hash table data structure that combines high performance with very good space efficiency. This preliminary report outlines avenues for analysis and implementation that we intend to pursue

arXiv.org e-Print Archive

ShockHash: Towards Optimal-Space Minimal Perfect Hashing Beyond Brute-Force

Author: Lehmann Hans-Peter
Sanders Peter
Walzer Stefan
Publication venue
Publication date: 13/11/2023
Field of study

A minimal perfect hash function (MPHF) maps a set

S

n

keys to the first

n

integers without collisions. There is a lower bound of

n\log_2e-O(\log n)

bits of space needed to represent an MPHF. A matching upper bound is obtained using the brute-force algorithm that tries random hash functions until stumbling on an MPHF and stores that function's seed. In expectation,

e^n\textrm{poly}(n)

seeds need to be tested. The most space-efficient previous algorithms for constructing MPHFs all use such a brute-force approach as a basic building block. In this paper, we introduce ShockHash - Small, heavily overloaded cuckoo hash tables. ShockHash uses two hash functions

h_0

and

h_1

, hoping for the existence of a function

f : S \rightarrow \{0,1\}

such that

x \mapsto h_{f(x)}(x)

is an MPHF on

S

. In graph terminology, ShockHash generates

n

-edge random graphs until stumbling on a pseudoforest - a graph where each component contains as many edges as nodes. Using cuckoo hashing, ShockHash then derives an MPHF from the pseudoforest in linear time. It uses a 1-bit retrieval data structure to store

f

using

n + o(n)

bits. By carefully analyzing the probability that a random graph is a pseudoforest, we show that ShockHash needs to try only

(e/2)^n\textrm{poly}(n)

hash function seeds in expectation, reducing the space for storing the seed by roughly

n

bits. This makes ShockHash almost a factor

2^n

faster than brute-force, while maintaining the asymptotically optimal space consumption. An implementation within the RecSplit framework yields the currently most space efficient MPHFs, i.e., competing approaches need about two orders of magnitude more work to achieve the same space

arXiv.org e-Print Archive

SicHash -- Small Irregular Cuckoo Tables for Perfect Hashing

Author: Lehmann Hans-Peter
Sanders Peter
Walzer Stefan
Publication venue
Publication date: 08/11/2022
Field of study

A Perfect Hash Function (PHF) is a hash function that has no collisions on a given input set. PHFs can be used for space efficient storage of data in an array, or for determining a compact representative of each object in the set. In this paper, we present the PHF construction algorithm SicHash - Small Irregular Cuckoo Tables for Perfect Hashing. At its core, SicHash uses a known technique: It places objects in a cuckoo hash table and then stores the final hash function choice of each object in a retrieval data structure. We combine the idea with irregular cuckoo hashing, where each object has a different number of hash functions. Additionally, we use many small tables that we overload beyond their asymptotic maximum load factor. The most space efficient competitors often use brute force methods to determine the PHFs. SicHash provides a more direct construction algorithm that only rarely needs to recompute parts. Our implementation improves the state of the art in terms of space usage versus construction time for a wide range of configurations. At the same time, it provides very fast queries

arXiv.org e-Print Archive

KITopen

Bipartite ShockHash: Pruning ShockHash Search for Efficient Perfect Hashing

Author: Lehmann Hans-Peter
Sanders Peter
Walzer Stefan
Publication venue
Publication date: 23/10/2023
Field of study

A minimal perfect hash function (MPHF) maps a set of n keys to the first n integers without collisions. Representing this bijection needs at least

\log_2(e) \approx 1.443

bits per key, and there is a wide range of practical implementations achieving about 2 bits per key. Minimal perfect hashing is a key ingredient in many compact data structures such as updatable retrieval data structures and approximate membership data structures. A simple implementation reaching the space lower bound is to sample random hash functions using brute-force, which needs about

e^n \approx 2.718^n

tries in expectation. ShockHash recently reduced that to about

(e/2)^n \approx 1.359^n

tries in expectation by sampling random graphs. With bipartite ShockHash, we now sample random bipartite graphs. In this paper, we describe the general algorithmic ideas of bipartite ShockHash and give an experimental evaluation. The key insight is that we can try all combinations of two hash functions, each mapping into one half of the output range. This reduces the number of sampled hash functions to only about

(\sqrt{e/2})^n \approx 1.166^n

in expectation. In itself, this does not reduce the asymptotic running time much because all combinations still need to be tested. However, by filtering the candidates before combining them, we can reduce this to less than

1.175^n

combinations in expectation. Our implementation of bipartite ShockHash is up to 3 orders of magnitude faster than original ShockHash. Inside the RecSplit framework, bipartite ShockHash-RS enables significantly larger base cases, leading to a construction that is, depending on the allotted space budget, up to 20 times faster. In our most extreme configuration, ShockHash-RS can build an MPHF for 10 million keys with 1.489 bits per key (within 3.3% of the lower bound) in about half an hour, pushing the limits of what is possible

arXiv.org e-Print Archive

High Performance Construction of RecSplit Based Minimal Perfect Hash Functions

Author: Bez Dominik
Kurpicz Florian
Lehmann Hans-Peter
Sanders Peter
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual European Symposium on Algorithms (ESA 2023)
Publication date: 01/01/2023
Field of study

A minimal perfect hash function (MPHF) bijectively maps a set S of objects to the first |S| integers. It can be used as a building block in databases and data compression. RecSplit [Esposito et al., ALENEX\u2720] is currently the most space efficient practical minimal perfect hash function. It heavily relies on trying out hash functions in a brute force way. We introduce rotation fitting, a new technique that makes the search more efficient by drastically reducing the number of tried hash functions. Additionally, we greatly improve the construction time of RecSplit by harnessing parallelism on the level of bits, vectors, cores, and GPUs. In combination, the resulting improvements yield speedups up to 239 on an 8-core CPU and up to 5438 using a GPU. The original single-threaded RecSplit implementation needs 1.5 hours to construct an MPHF for 5 Million objects with 1.56 bits per object. On the GPU, we achieve the same space usage in just 5 seconds. Given that the speedups are larger than the increase in energy consumption, our implementation is more energy efficient than the original implementation

Dagstuhl Research Online Publication Server

High Performance Construction of RecSplit Based Minimal Perfect Hash Functions

Author: Bez Dominik
Kurpicz Florian
Lehmann Hans-Peter
Sanders Peter
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 12/09/2023
Field of study

KITopen

Learned Monotone Minimal Perfect Hashing

Author: Ferragina Paolo
Lehmann Hans-Peter
Sanders Peter
Vinciguerra Giorgio
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 12/09/2023
Field of study

A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms. In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone mapping from keys to their rank via an error-bounded piecewise linear model (the PGM-index), and then we solve the collisions that might arise among keys mapping to the same rank estimate by associating small integers with them in a retrieval data structure (BuRR). On synthetic random datasets, LeMonHash needs 34% less space than the next larger competitor, while achieving about 16 times faster queries. On real-world datasets, the space usage is very close to or much better than the best competitors, while achieving up to 19 times faster queries than the next larger competitor. As far as the construction of LeMonHash is concerned, we get an improvement by a factor of up to 2, compared to the competitor with the next best space usage. We also investigate the case of keys being variable-length strings, introducing the so-called LeMonHash-VL: it needs space within 13% of the best competitors while achieving up to 3 times faster queries than the next larger competitor

KITopen

Learned Monotone Minimal Perfect Hashing

Author: Ferragina Paolo
Lehmann Hans-Peter
Sanders Peter
Vinciguerra Giorgio
Publication venue
Publication date: 21/04/2023
Field of study

A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms. In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone mapping from keys to their rank via an error-bounded piecewise linear model (the PGM-index), and then we solve the collisions that might arise among keys mapping to the same rank estimate by associating small integers with them in a retrieval data structure (BuRR). On synthetic random datasets, LeMonHash needs 35% less space than the next best competitor, while achieving about 16 times faster queries. On real-world datasets, the space usage is very close to or much better than the best competitors, while achieving up to 19 times faster queries than the next larger competitor. As far as the construction of LeMonHash is concerned, we get an improvement by a factor of up to 2, compared to the competitor with the next best space usage. We also investigate the case of keys being variable-length strings, introducing the so-called LeMonHash-VL: it needs space within 10% of the best competitors while achieving up to 3 times faster queries

arXiv.org e-Print Archive

Learned Monotone Minimal Perfect Hashing

Author: Ferragina Paolo
Lehmann Hans-Peter
Sanders Peter
Vinciguerra Giorgio
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual European Symposium on Algorithms (ESA 2023)
Publication date: 01/01/2023
Field of study

Dagstuhl Research Online Publication Server